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1. INTRODUCTION 

Currently, the impact of social media in our daily life tends to increase. Moreover, social media can 
have good or bad impacts together. Social media such as Twitter or Facebook produce news or information 
which can be easily spread around the globe. In terms of hoax, the information will be good if it is genuine 
and has gone through good reasoning. But the fact people spread false information to gain particular benefits 
is increasing every year and precisely It has increased sharply in the past two years [1, 2]. Active social 
media accounts also increase every year, including the ones producing hoax information. This cause people 
connected to social media have difficulties to determine whether they read genuine or false information. 
The situation worsens as hoax spread over social media networks read by more and more people, especially 
in Indonesia the country with the third-largest social media penetration in the world in 2018 [3]. 
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Hoax is false information that is considered correct and can mislead human perception [4, 5]. 
Spreading hoax information usually has multiple purposes, with the aim of persuading or manipulating public 
opinion. The spread of hoaxes is usually accompanied by fraud and even threats. In 2016, there were around 
800,000 hoax sites that produce false information which widely distributed over social media, such as 
Twitter, Facebook and others [6], even hoax news is increasingly prevalent with evil political goals [3]. 
The spread of hoaxes has a very broad impact and even it has many potentials of causing dangerous 
horizontal conflicts for the stability of the whole country. Thus, a hoax detection system is needed to 
automatically help citizen and government filtering information. 

Research on the hoax detection system has been carried out in recent years such as the ones 
in [2, 4, 5, 7-11]. These studies propose classification or learning techniques, where this technique always 
requires up-to-date training data to maintain the accuracy of the detection. On the other hand, the searching 
technique to detect hoax news can be done using a snippet as presented in the following studies [12-14]. 
Searching techniques have the advantage of being more up to date and more practical in use. Therefore, this 
paper proposes hoax news detection techniques employing searching techniques that are combined with 
classifier methods to improve accuracy. A further drawback of the existing hoax detection system is they are 
not equipped with sentiment analysis features. To address the problem, Sentiment analysis feature is 
proposed. In our system, sentiment analysis is carried out after hoax news is detected. Sentiment analysis can 
extract the true hidden sentiment inside hoax whether positive sentiment or negative sentiment. This feature 
helps us to further extract the motivation of the hoax which can be for black campaigns or not. Hence it is 
necessary to know its sentiment classification in response to the hoax news. Some methods that are widely 
used to classify text and conduct sentiment analysis are Naive Bayes [15-19], Support Vector Machine 
[20-22], and KNN [23, 24]. In this research, Naive Bayes method was chosen to carry out classification and 
sentiment analysis on Hoax news. Naive Bayes as a machine learning probabilistic approach tends to works 
well for handling training sets that change over time. Furthermore, it was chosen because Naive Bayes has 
proven to produce good, fast accuracy and can work well on the verification of sentiment analysis with 
relatively few training data [15, 25, 26]. 

In several previous text classification and hoax detection studies, the performance of classification 
methods can be optimized by using feature selection methods such as particle swarm optimization (PSO), 
information grain (IG) and genetic algorithm (GA) [5, 22, 27, 28]. In the previous studies, we conclude that 
PSO has several advantages over other methods, such as easy to implement, it can also search for optimal 
values and have algorithmic models that can be further improved. PSO is also widely employed in 
the problem of classification, clustering, and selection of text features [29-31]. After conducting analysis and 
hypothesis based on previous research, this paper proposed algorithm for developing a hoax news detection 
system, with the combination of searching techniques and its optimization, and also equipped with sentiment 
analysis. 


2. RESEARCH METHOD 

There are several methods we had studied in the kinds of literature. This lead to the conclusion that 
search technigue is more practical than learning technigue for hoax detection. Thus, this paper proposes 
searching technigues to classify hoax news in a more practical and up to date manner because crawling 
processes can be carried out every time by checking the news. The accuracy of the results is much better for 
freguent searching as the guery over web can be posed every time. The classification process is done using 
the cosine similarity metric. Furthermore, the news ware then further processed by the sentiment analysis 
process using Naïve Bayes. This algorithm is then optimized by the PSO. To be focused, sentiment analysis 
is carried out only for news that was detected as hoax based on the searching approach. On the other hand, 
our approach crawls data from social media Twitter and Facebook. The explanation of the approach is 
elaborated with the following Figure 1. 


2.1. Hoax detection 

Before the hoax detection process is carried out, input gueries are performed by the user, gueries 
from the user containing the Keyword news that will be searched. Input gueries are used to collect news data. 
Data collection is done by crawling to retrieve Indonesian language news snippets through searching 
facilities provided by Google by utilizing the Google API. Google Custom Search makes it possible to make 
search engines as desired. Where the web snippet process will be directed to the turnbackhoax.id website, 
stophoax.id, operain.blogspot.com, and ayomajuterus.blogspot.com. Next, the similarity of document search 
results and text input is calculated using cosine similarity. The results of the calculation of cosine similarity 
will produce a percentage of hoax results. Cosine similarity (cs) can be calculated by the formula (1) [32]. 
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Figure 1. The workflow of the proposed method 


2.2. Sentiment analysis 

To find out the sentiment towards the news, then the data search is based on the selected news. Next 
is crawling sentiment on social media websites such as Facebook and Twitter. Preprocessing results from 
crawling data to optimize feature extraction and classification results. Preprocessing consists of folding cases, 
filtering, tokenizing, and stemming [33], whereas, for data aggregation technique, this work relies on our 
previous work presented in [34-36]. Case folding is done to change all letters to lowercase. Filtering or often 
referred to as stop word removal is used to delete words that are not too important, tokenizing is used to 
break the input of the query into words per word, and stemming is used to remove word additions so that the 
basic words are attached. From the results of preprocessing results, it is calculated the number of occurrences 
of each word in each document and then calculate TFIDF for each word with the formula (2) [28]. 


N 
W; = tfi * MT (2) 


Where, W; is the weight of i, tf; is the number of occurrences from i, df; the number of documents 
containing i, and N is the total number of documents. After the term weighting value is obtained, then this 
weight value is used as a reference for PSO particles. The first step in PSO is an input of population numbers. 
Each population initializes particles that represent each feature / word with position = random numbers from 
O - 1 and velocity = 0. Then sort by the highest position value. 

Next, calculate the NB categorization with the reduced feature based on the highest particle position 
value. A term with low value will not be used for classification. It means that particle values are restricted 
to a certain rank, for example, if there are 32 particles, and are limited to 20 particles with the highest value 
then particles in the order of 21 to 32 are not used. Next, do the probability calculation using formula (3). 
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1 (Ai — Hij)? 
P(A;|6) = ——— exp [- >~ 3 
(AG) = ae? aa 3) 
Probability results will get a category or class from each document, then repeat this calculation on 


all documents to calculate the accuracy in the next process. Calculate the Naive Bayes accuracy of each 
population by formula (4). 


total of document correct (4) 
accuracy = ———__—_—___—_———_ 
7 the total number of documents 


Then calculate whether the Naïve Bayes accuracy is better than the best accuracy and the best 
accuracy. If the accuracy of Naïve Bayes in the current population is better than Pbest and Gbest then 
the population is now used as the new Pbest and Gbest. To calculate the speed and update position of particle 
positions using formula (5) and for particle, position updates using formula (6). To see more clearly about the 
flow of the process at this stage can see Figure 2. 
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Figure 2. Details process of NB Classifier with PSO optimization for hoax detection 
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Repeat the steps to calculate the probabilities of the PBest and GBest until the iteration is complete 
and produce a model with Gbest accuracy. When the iteration is complete, the sequential population is 
obtained from the highest to lowest Naive Bayes accuracy, and the GBest value is used as a feature model to 
produce analytical sentiment. To see the flow of the algorithm more clearly you can see Figure 2. 
Then perform a Naive Bayes probability calculation in three classes, namely positive, negative and neutral. 
Furthermore, the results of class classification are written with the highest probability value. 


3. RESULTS AND DISCUSSION 

In this research used hoax news data that is widely disseminated through social media by crawling 
data, from the data that have been searched for 30 news hoax samples, news samples can be seen in Table 1. 
From all the news data, cosine similarity is done and the calculation value can be seen in Table 2. 


Table 1. Sample of indonesian news and its category 








No Title 
1 Web KPU Diretas, Temuan Mengejutan!! Jokowi Angkat Isu PKI 
Bocoran Informasi Penting Valid Pola Kecurangan Sistem Penghitungan Suara KPU Dengan Modus Nomor 01 
dan 02 
3 Kertas suara Pemilu dibakar seperti sampah, kecurangan ini mau didiamkan karena dilindungi oleh aparat dan 
pejabat? 
4 Menpora Imam Nahrawi Mundur Dari Jabatannya 
5 Gambar Rancangan Gedung Istana Negara di Palangkaraya 
6 Megawati Soekarnoputri Dirawat di Rumah Sakit karena Stroke 
30 simpatisan pki bacok seorang ulama di daerah banten 


Table 2. Cosine similarity results 





No Cosine Similarity No Cosine Similarity No Cosine Similarity (%) 
1 89.5669 11 74.8455 21 66.2266 
2 66.7424 12 72.6273 22 86.0663 
3 88.6405 13 90.8688 23 84.6327 
4 67.8844 14 71.9092 24 68.1385 
5 75.0587 15 71.4435 25 76.7366 
6 91.6342 16 81.1107 26 67.3435 
7 66.9439 17 75.3778 27 76.3323 
8 67.1937 18 77.4070 28 85.5236 
9 73.7210 19 74.7265 29 77.6899 
10 86.7227 20 69.2308 30 89.9647 





From the 30 data above, the percentage value of the average cosine similarity calculation is around 
77,07790. The meaning appears that this method can identify hoax news well, where the highest cs value is 
91.6342% and the lowest is 66.2266%, although the average percentage value is not high all calculations lead 
to the correct classification. In the next process, the calculation of sentiment analysis on the news was carried 
out using the naive Bayes and PSO methods that had been proposed previously. Sentiment analysis was 
divided into three categories, namely positive, negative and neutral. Table 3 shows the results of 
the sentiment analysis of the proposed method. 

From these results, it can be concluded that there are 19 results of the correct child sentiment. 
Although the level of accuracy of the sentiment is probably not very high, the accuracy of the sentiment 
analysis is still better and faster compared to other methods such as the KNN. The process of calculating 
the sentiment analysis for each document is also faster than that of the KNN where the NB method can 
calculate the average of each document 0.4733 seconds and the KNN calculates the average of each 
document 6,213 seconds with the same computer specification. Table 4 shows the comparison of the results 
of the classification sentiment analysis between the Naive Bayes method, Naive Bayes + PSO, KNN only, 
and KNN + PSO. 





Hoax classification and sentiment analysis of Indonesian news using naive... (Heru Agus Santoso) 


804 O 


Table 3. Sentiment analysis results 
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Table 4. Comparison sentiment analysis results of each method 








Number of News Sentiment class KNN Naive Bayes KNN+ PSO Naive Bayes F PSO 
(Proposed Method) 
1 negative positive negative negative negative 
2 negative neutral negative neutral negative 
3 neutral negative negative negative positive 
4 positive positive positive positive positive 
5 neutral positive negative neutral neutral 
6 positive positive positive positive positive 
7 negative neutral negative neutral negative 
8 negative neutral negative neutral negative 
9 negative neutral negative neutral negative 
10 positive positive positive positive positive 
11 neutral neutral positive neutral positive 
12 negative negative negative negative negative 
13 negative positive negative positive negative 
14 neutral negative negative negative negative 
15 positive neutral neutral neutral neutral 
16 neutral neutral positive neutral positive 
17 neutral neutral neutral neutral neutral 
18 positive negative negative negative negative 
19 positive positive positive positive positive 
20 positive positive positive positive positive 
21 positive positive negative negative negative 
22 positive positive positive positive positive 
23 negative positive negative negative negative 
24 negative negative neutral negative neutral 
25 negative neutral positive neutral positive 
26 neutral neutral neutral neutral neutral 
27 positive positive positive positive positive 
28 negative positive positive positive positive 
29 negative negative positive negative positive 
30 negative neutral negative negative negative 
Number of correct classification 17 18 18 19 
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4. CONCLUSION 

In this research, hoax detection methods have been proposed using searching methods, users enter 
queries to search for news that is considered hoaxes. After the hoax news title is obtained, classification is 
done using the searching method using Google custom search and snippet. The results are classified by 
the cosine similarity method, based on the results of testing of 30 news, the average hoax is 77%, where all 
the news is detected as a hoax with a minimum percentage of about 66% and a maximum of 91%. 
This shows that the performance of the proposed method is reliable enough to detect hoax news. This system 
is also equipped with sentiment analysis process using Naive Bayes which is optimized by the PSO method, 
based on the results of testing the sentiment analysis method of the proposed sentiment works better than 
the other methods proposed earlier. 
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